From the data set provided below, concerning the energy consumption of a retail store in 2014, we build a timeseries model to predict the consumed energy (kWh) made within a specific time frame. We provide both a short-term (60 & 90 calendar days) and a long-term forecast (semester & one year prediction). The simulated time series of energy consumption which is returned by the model is surprisingly similar with the one provided, except some peaks during the hot summer days and the New Year's Eve. At these times, the forecasted consumption is significantly lower, than the one had during the same periods in 2014. Having historical data of two and more years, would allow us to better incorporate these seasonalities in our model and hopefully provide even better predictions.
In another front, we compare our results with the one returned by the FB prophet library, as it is was trained over the Energy Consumption time series (kWh) alone. We find that this model also provide good results, although shifted towards greater values of forecasted consumed energy than the one had during the same periods in 2014. We believe that having the ability to incorporate the 'kWh' ~ 'Temp' & 'Traffic' dependence in this model, would make it much better.
Dataset consumption.csv contains all signals of a smart energy meter that it is plugged in the store, along with information about weather conditions of the area and the traffic of the store, between dates '2014-01-01' and '2014-12-31'.
The file is in comma separated format and contains the following variables:
'signal_id': unique id of every signal of the smart meter'kW': the consumption that the signal transmits the given moment (have in mind: kWh = kW * h).'temperature': the external temperature on the given moment in Celsius'traffic': a category of the amount of traffic in the store the given moment (0: no traffic/closed - 4: peak)'timestamp': the timestamp of the signal
In [1]:
# R LANG::
# Define where my R libraries reside
.libPaths(c('/home/theod/R/x86_64-redhat-linux-gnu-library/3.3', '/usr/lib64/R/library/'))
# Load required libraries
library('astsa')
library('mgcv')
library('dplyr')
library('zoo')
library('ggplot2')
library('grid')
library('reshape2')
library('prophet')
library('rstan')
# Globally set the ggplot2 theme options of preference
old_theme <- theme_get()
theme_set(theme_bw())
theme_update(plot.title=element_text(face='bold'))
In [3]:
# R LANG::
# Load the data set
consumpt_df <- read.csv('./02.consumption.csv')
consumpt_df$timestamp <- as.POSIXct(consumpt_df$timestamp, '%Y-%m-%d %H:%M:%S')
In [4]:
# R LANG::
head(consumpt_df)
In [5]:
# R LANG::
# DataFrame structure
str(consumpt_df)
In [6]:
# R LANG::
summary(consumpt_df)
In [7]:
# R LANG::
# Check the uniqueness of key in data set
num_unique_values = length(unique(consumpt_df$signal_id))
num_records = length(consumpt_df$signal_id)
is_key = (num_unique_values == num_records)
paste('Checks the uniqueness of "signal_id" [key]:', is_key)
In [8]:
# R LANG::
# Check the uniqueness of key in data set
num_unique_values = length(unique(consumpt_df$timestamp))
num_records = length(consumpt_df$timestamp)
is_key = (num_unique_values == num_records)
paste('Checks the uniqueness of "timestamp" [key]:', is_key)
In [9]:
# R LANG::
summary(consumpt_df$timestamp)
In [10]:
# R LANG::
str(consumpt_df$traffic)
In [11]:
# R LANG::
summary(consumpt_df$traffic)
In [12]:
# R LANG::
summary(consumpt_df$kw)
In [13]:
# R LANG::
summary(consumpt_df$temperature)
First, we create a new attribute ('kWh') in the 'consumpt_df' DataFrame, to accomodate the Energy Consumption made by the retail store. Since the provided data set has been sampled every single hour, this new 'kWh' attribute will be equal-valued with the provided 'kw' one. We also change the colname 'temperature' to something sorter for our convenience.
In [14]:
# R LANG::
head(consumpt_df, n=10)
In [15]:
# R LANG::
consumpt_df$kWh = consumpt_df$kw
names(consumpt_df)[4] <- 'Temp'
names(consumpt_df)[5] <- 'Traffic'
head(consumpt_df)
In [16]:
# R LANG::
grid.newpage()
pushViewport(viewport(layout = grid.layout(2,1)))
# "Energy consumption" vs "Retail Store Traffic"
f1 <- (ggplot(consumpt_df) +
geom_boxplot(aes(Traffic, kWh, colour=Traffic), alpha=0.75) +
labs(title='Energy Consumption per Traffic Status', xlab='Retail Store Traffic', ylab='kWh'))
# "Energy Consumption" vs "Temperature"
f2 <- (ggplot(consumpt_df) +
geom_point(aes(Temp, kWh)) +
geom_smooth(aes(Temp,kWh), colour='red', alpha=0.75) +
labs(title='Energy Consumption vs Temperature', x='Temp (\u00b0C)', y='kWh'))
# Print the plots in the pre-defined Viewport
print(f1, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(f2, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))
dev.off()
By resampling the Energy Consumption Timeseries on a Daily Basis, we can have a better understanding of its time evolution and its dependence / correlation with the corresponding Temperature Timeseries. The rolling moving averages in 5 and in 30 calendar days, MA5 and MA30, are also depicted for both series. Note that the standard deviation of the 'kWh' series greatly varies taking values from 0 to ~750 kWh, and appears peak during the hot summer months and the last few days of 2014.
In [17]:
# R LANG::
head(consumpt_df)
In [18]:
# R LANG::
# Create the time series of interest:
# (1) Energy Consumption TimeSeries ["KWh"]
kWh <- ts(consumpt_df$kWh,
start = c(2014,1),
end = c(2014,nrow(consumpt_df)),
frequency=24*365)
# (2) External Temperature TimeSeries ["Temp"]
Temp <- ts(consumpt_df$Temp,
start=c(2014,1),
end=c(2014,nrow(consumpt_df)),
frequency=24*365)
# (3) Traffic Status TimeSeries ["Traffic"]
consumpt_df['Date'] <- as.Date(consumpt_df$timestamp)
TrafficFreq <- (consumpt_df %>%
group_by(Date, Traffic) %>%
count(Traffic) %>%
filter(Traffic != 'closed') %>%
filter(n==max(n)) %>%
filter(as.integer(Traffic)==max(as.integer(Traffic))))
rTraffic <- ts(TrafficFreq$Traffic,
start = c(2014,1),
end = c(2014,nrow(TrafficFreq)),
frequency=365)
# R LANG::
# Energy Consumption & Temperature TimeSeries plots
par(mfrow=c(2,1))
# Resample TimeSeries of interest on a Daily Basis
rkWh <- aggregate(kWh, nfrequency = 365, FUN = sum)
rTemp <- aggregate(Temp, nfrequency = 365, FUN = mean)
# Cbind the TimeSeries of interest in a DataFrame
ragg <- data.frame(cbind(date=index(rkWh),
rkwh=melt(rkWh)$value,
rkwh_ma5=melt(rollmean(rkWh,5))$value,
rkwh_ma30=melt(rollmean(rkWh,30))$value,
rkwh_std5=melt(rollapply(data=rkWh, width=5, FUN=sd, fill=NA))$value,
rtemp=melt(rTemp)$value,
rtemp_ma5=melt(rollmean(rTemp,5))$value,
rtemp_ma30=melt(rollmean(rTemp,30))$value))
In [19]:
# R LANG:: [ggplot2+]
# Energy Consumption & Temperature TimeSeries plots
grid.newpage()
pushViewport(viewport(layout = grid.layout(3,1, widths=unit(1,'npc'), heights=unit(1/3,'npc'))))
fig1 <- (ggplot(ragg, aes(date)) +
geom_line(aes(y=rkwh, colour='rkWh')) +
geom_line(aes(y=rkwh_ma5, colour='MA5')) +
geom_line(aes(y=rkwh_ma30, colour='MA30')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 1),
legend.background = element_blank()) +
labs(title='Energy Consumption (kWh) [2014]', x='Months in 2014', y='kWh',
colour='Series') +
scale_x_yearmon())
fig2 <- (ggplot(ragg, aes(date)) + #, width=14, height=6, units='cm') +
geom_line(aes(y=rkwh_std5, colour='Std5')) +
theme(legend.position = c(0.9,0.9), legend.justification = c(0.5,1),
legend.background = element_blank()) +
labs(title='Standard Deviation of Energy Consumption (kWh) [2014]', x='Months in 2014', y='(+/-) kWh',
colour='Series') +
scale_x_yearmon())
fig3 <- (ggplot(ragg, aes(date)) + #, width=14, height=6, units='cm') +
geom_line(aes(y=rtemp, colour='rTemp')) +
geom_line(aes(y=rtemp_ma5, colour='MA5')) +
geom_line(aes(y=rtemp_ma30, colour='MA30')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 1),
legend.background = element_blank()) +
labs(title='External Temperature (\u00b0C) [2014]', x='Months in 2014', y='Temp (\u00b0C)',
colour='Series') +
scale_x_yearmon())
# Print the plots in the pre-defined Viewport
print(fig1, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(fig2, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))
print(fig3, vp=viewport(layout.pos.row = 3, layout.pos.col = 1))
dev.off()
'kWh' Timeseries: Univariate analysis and autoregression model adequacyAs a next step, we try to figure out if we can study the Energy Consumption Timeseries ('kWh') alone, and if some autorgression model will be adequate enough to provide predictions.
As shown below, there is no easy way to figure out the autoregression model that might fit the data well. The Energy Consumption Timeseries ('kWh') appears to have huge variability, especially during the summer months, and there is no easy way to absorb it through a suitable transformation. Even if we detrend the provided timeseries, the results we obtain are not great.
In [20]:
# R LANG::
fit <- lm(rkWh ~ time(rkWh), na.action = NULL)
rkWh2 <-resid(fit) # Detrended rkWh
tempDF <- data.frame(cbind(date=index(rkWh2),
rkwh2=melt(rkWh2)$value))
fig <- (ggplot(tempDF, aes(date)) +
geom_line(aes(y=rkwh2, colour='kWh')) +
theme(legend.position = c(0.9,0.9), legend.justification = c(0.5,1),
legend.background = element_blank()) +
labs(title='Detrended Energy Consumption (kWh) [2014]', x='Months in 2014', y='Detrended kWh',
colour='Series') +
scale_x_yearmon())
print(fig)
In [21]:
# R LANG::
# ACF/PACF Plots
acf <- acf2(rkWh2, max.lag=52) # Suggests diff(x, lag=5, differences=1)
acf <- acf2(diff(rkWh2, lag=7, differences = 1), max.lag=52) # Suggests ???
#acf <- acf2(diff(diff(rkWh2, lag=7, differences = 1)), max.lag=52) # Suggests ???
'kWh' ~ 'Temp' & 'Traffic' dependenceAs stated in the part 3 of this analysis the correlation between the 'kWh' and 'Temp' timeseries is apparent, whereas a quadratic dependence of "kWh" from the corresponding "Temp" values is more appropriate to consider. In addition, the energy consumption obviously depends on the traffic status of the store but probably not strongly.
In order to incorporate these dependences in our model, we are going to fit a regression model with autocorrelated errors. To do so we will follow an algorithm which has been introduced by Cochrane and Orcutt (1949), and it is described in the "Time Series Analysis and Its Applications: With R Examples" (Springer Texts in Statistics), by R. H. Shumway and D. S. Stoffer.
More specificaly, assuming the regression model below:
$$ y_{t} = \mathbf{\beta}\mathbf{z_{t}} + x_{t} $$where $\mathbf{\beta}$ is an $r\times 1$ vector of regression parameters, $\mathbf{z_{t}}$ an $r\times 1$ vector of regressors (exogeneous variables), and $x_{t}$ a process with some covariance function $\gamma(s,t)$ we will:
Run an ordinary regression fit of $y_{t}$ ("KWh") on $x_{t}$ ("Temp") (acting as if the errors are uncorrelated), and retain the residuals.
Fit an ARMA Model to the residuals $\widehat{x_{t}}=y_{t}-\widehat{\mathbf{\beta}}\mathbf{z_{t}}$, say $$ \widehat{\phi}(B)\widehat{x_{t}} = \widehat{\theta}(B)w_{t} $$
Apply the ARMA transformation to both sides of the initial model in order to obtain the transformed regression one. $$ u_{t} = \frac{\widehat{\phi}(B)}{\widehat{\theta}(B)}y_{t}\quad\text{and}\quad \mathbf{v_{t}}=\frac{\widehat{\phi}(B)}{\widehat{\theta}(B)}\mathbf{z_{t}} $$
Run an ordinary least squares regression assuming uncorrelated errors on the transformed regression model $$ u_{t} = \mathbf{\beta}\mathbf{v_{t}} + w_{t}\,, $$ where $w_{t}$ is expected now to be... white noise!
In [22]:
# R LANG::
# Regression Models To Better Investigate
rkwh_trend <- time(rkWh)
rtemp_demean <- rTemp - mean(rTemp)
rtemp_demeanSq <- rtemp_demean^2
rtemp_demeanCb <- rtemp_demean^3
fit1 <- lm(rkWh ~ rkwh_trend + rtemp_demean, na.action=NULL)
fitSq1 <- lm(rkWh ~ rkwh_trend + rtemp_demean + rtemp_demeanSq , na.action=NULL)
fitCb1 <- lm(rkWh ~ rkwh_trend + rtemp_demean + rtemp_demeanSq + rtemp_demeanCb, na.action=NULL)
fit2 <- lm(rkWh ~ rkwh_trend + rtemp_demean + rTraffic, na.action=NULL)
fitSq2 <- lm(rkWh ~ rkwh_trend + rtemp_demean + rtemp_demeanSq + rTraffic, na.action=NULL)
fitCb2 <- lm(rkWh ~ rkwh_trend + rtemp_demean + rtemp_demeanSq + rtemp_demeanCb + rTraffic, na.action=NULL)
fitLoess1 <- loess(rkWh ~ rkwh_trend + rtemp_demean, model=TRUE)
fitGam1 <- gam((rkWh ~ s(rtemp_demean, bs='gp')
+ ti(rkwh_trend, rtemp_demean, by=rTraffic, bs='gp')))
fitGam2 <- gam((rkWh ~ rTraffic + s(rtemp_demean, bs='gp')
+ ti(rkwh_trend, rtemp_demean, by=rTraffic, bs='gp')))
# R LANG::
# ANOVA Results & Statistics Information Criteria
# Regression Models Summary
models <- list(fit1, fitSq1, fitCb1, fit2, fitSq2, fitCb2, fitLoess1, fitGam1, fitGam2)
modelnames <- list('fit1', 'fitSq1', 'fitCb1', 'fit2', 'fitSq2', 'fitCb2', 'fitLoess1', 'fitGam1', 'fitGam2')
writeLines('Analysis of Variance & Statistics Information Criteria:\n')
for(i in seq(1,length(models))) {
s = paste0('Model ', modelnames[[i]], ':')
writeLines(s)
writeLines(paste0(strrep('-', nchar(s) + 3), '\n'))
print(summary(models[[i]]))
writeLines('\n')
}
# ANOVA Results
print(anova(fit1,fitSq1,fitCb1,fit2,fitSq2,fitCb2,fitGam1,fitGam2))
# Statistics Information Criteria
df <- AIC(fit1,fitSq1,fitCb1,fit2,fitSq2,fitCb2,fitGam1,fitGam2)
df[order(df$AIC),]
df <- BIC(fit1,fitSq1,fitCb1,fit2,fitSq2,fitCb2,fitGam1,fitGam2)
df[order(df$BIC),]
In [23]:
summary(fitGam1)
In [24]:
summary(fitSq1)
From the ANOVA results shown above, the statistics information criteria (AIC, BIC), and the variance explained by each of the models, we choose the fitGam1 model as the one that fits our data best, with the more simpler ones, fitCb1 and fitSq1, coming immediately after. Due to the relatively small data size, we have assumed that the AIC results are more reliable for our problem. An important advantage of the fitGam1 model over the other two, is that incorporate nicely the energy consumption dependence on the external temperature and on the traffic in store through appropriately configured interactions, i.e. the
rkWh ~ s(rtemp_demean, bs = "gp") + ti(rkwh_trend, rtemp_demean, by = rTraffic, bs = "gp"),
explains 81.7% of deviance in the provided data, and as we will show below, it provides residuals which can be predicted by an appropriate autoregression model extremely well! On the contrary, the residuals of the fitCb1 model are not very helpful to figure out an appropriate autoregression model to fit. Consequently, we will choose only the fitGam1 and fitSq1 models to further investigate.
fitGam1
In [25]:
# R LANG::
summary(fitGam1)
plot.ts(resid(fitGam1))
acf <- acf2(resid(fitGam1), max.lag = 52) # Suggests diff(x,lag=7, differences=1)
acf <- acf2(diff(resid(fitGam1), lag=7, differences = 1), max.lag = 52) # Suggests [ARMA(?,?) x ARMA(0,2)_{7}]
acf <- acf2(diff(resid(fitGam1), lag=7, differences = 1), max.lag = 7) # Suggests [ARMA(1,3(0)) x ARMA(0,2)_{7}]
plot.ts(diff(resid(fitGam1), lag=7, differences = 1))
AR and MA ordersThe previous ACF/PACF plots suggest that an $\mathbf{ARIMA(1,0,3)\,\times\,(0,1,2)_{7}}$ model will provide a really good fit to our data, probably with some variations in the non-seasonal AR and MA orders.
In the few lines of code below we examine this fit, the results we obtain from the 'arima' R routine, as well as slight variations of this model which are suggested by the t-Statistics of its calculated coefficients. We conclude that the $\mathbf{ARIMA(1,0,2)\,\times\,(0,1,1)_{7}}$ model, fitSq1_SARIMA2, fits our data best. Note that the design matrix which was used during the fitting of the fitSq1 model,
XReg <- fitGam1$linear.predictors,
has been plugged-in as a matrix of external regressors (exogenous variables), in the new ARIMA fits.
In [26]:
# R LANG::
# [ARIMA(1,0,3(0)) x ARIMA(0,1,2)_{7}]
summary(fitGam1)
# Design Matrix
XReg <- fitGam1$linear.predictors
# Suggested by the ACF/PACF Plots
sarima(rkWh,1,0,3,P=0,D=1,Q=2,S=7, xreg = XReg)
fitGam1_SARIMA1 <- arima(rkWh, order=c(1,0,3), seasonal = list(order=c(0,1,2), period=7), xreg=XReg)
# Suggested by the t-Statistics of the previous model:
sarima(rkWh,1,0,2,P=0,D=1,Q=1,S=7, xreg = XReg)
fitGam1_SARIMA2 <- arima(rkWh, order=c(1,0,2), seasonal = list(order=c(0,1,1), period=7), xreg=XReg)
From the diagnostic diagrams above, and especially the ones that corresponds to the best of our models, $\mathbf{ARIMA(1,0,2)\,\times\,(0,1,1)_{7}}$ (fitGam1_SARIMA2), we observe that no significant outliers exist in times series (Standardized Residuals timeseries plot), neither significant departure from normality (Normal Q-Q Plot of Residuals), except of course during the last few days of 2014. Furthermore, the ACF Plot of Residuals does not reveal any autocorrelation, and the Q-Statistics is never significant at the lags shown, supporting that way the whiteness of the residuals.
In [27]:
# R LANG::
summary(fitSq1)
plot.ts(resid(fitSq1))
acf <- acf2(resid(fitSq1), max.lag = 52) # Suggests diff(x,lag=7, differences=1)
acf <- acf2(diff(resid(fitSq1), lag=7, differences = 1), max.lag = 52) # Suggests [ARMA(?,?) x ARMA(0,2)_{7}]
acf <- acf2(diff(resid(fitSq1), lag=7, differences = 1), max.lag = 7) # Suggests [ARMA(1,0(3)) x ARMA(0,2)_{7}]
plot.ts(diff(resid(fitSq1), lag=7, differences = 1))
AR and MA ordersThe previous ACF/PACF plots suggest that an $\mathbf{ARIMA(1,0,3)\,\times\,(0,1,2)_{7}}$ model will provide a really good fit to our data, probably with some variations in the non-seasonal AR and MA orders.
In the few lines of code below we examine this fit, the results we obtain from the 'arima' R routine, as well as slight variations of this model which are suggested by the t-Statistics of its calculated coefficients. We conclude that the $\mathbf{ARIMA(1,0,2)\,\times\,(0,1,1)_{7}}$ model, fitSq1_SARIMA2, fits our data best. Note that the design matrix which was used during the fitting of the fitSq1 model,
XReg <- cbind(rkwh_trend, rtemp_demean, rtemp_demeanSq),
has been plugged-in as a matrix of external regressors (exogenous variables), in the new ARIMA fits.
In [28]:
# R LANG::
# [ARIMA(1,0,0(3)) x ARIMA(0,1,2)_{7}]
summary(fitSq1)
# Design Matrix
XReg <- cbind(rkwh_trend, rtemp_demean, rtemp_demeanSq)
# Suggested by the ACF/PACF Plots
sarima(rkWh,1,0,3,P=0,D=1,Q=2,S=7, xreg = XReg)
fitSq1_SARIMA1 <- arima(rkWh, order=c(1,0,3), seasonal = list(order=c(0,1,2), period=7), xreg=XReg)
# Suggested by the t-Statistics of the previous model:
sarima(rkWh,1,0,2,P=0,D=1,Q=1,S=7, xreg = XReg)
fitSq1_SARIMA2 <- arima(rkWh, order=c(1,0,2), seasonal = list(order=c(0,1,1), period=7), xreg=XReg)
From the diagnostic diagrams above, and especially the ones that corresponds to the best of our models, $\mathbf{ARIMA(1,0,2)\,\times\,(0,1,1)_{7}}$ (fitSq1_SARIMA2), we observe that no significant outliers exist in times series (Standardized Residuals timeseries plot), neither significant departure from normality (Normal Q-Q Plot of Residuals), except of course during the early few days of 2014 and the last ones of the same year. Furthermore, the ACF Plot of Residuals does not reveal any autocorrelation, and the Q-Statistics is never significant at the lags shown, supporting that way the whiteness of the residuals.
fitGam1_SARIMA2 vs fitSq1_SARIMA2
In [29]:
# R LANG::
# fitGam1_SARIMA2
writeLines('\'fitGam1_SARIMA2\' Model:')
fitGam1_SARIMA2
tsdiag(fitGam1_SARIMA2)
In [30]:
# R LANG::
# fitSq1_SARIMA2
writeLines('\'fitSq1_SARIMA2\' Model:')
fitSq1_SARIMA2
tsdiag(fitSq1_SARIMA2)
In [31]:
# R LANG::
AIC(fitGam1_SARIMA2, fitSq1_SARIMA2)
BIC(fitGam1_SARIMA2, fitSq1_SARIMA2)
From the diagnostics diagrams above and the Statistics Information Criteria, AIC/BIC, we can easily verify that the fitGam1_SARIMA2 model is supperior over the other, simpler one, fitSq1_SARIMA2.
In the next section, we are going to use the fitGam1_SARIMA2 model to provide short and long-term forecasts. The results are indeed great!
'kWh'): comparison with FB 'prophet' predictionsFinally, we provide an Energy Consumption Forecast ('kWh') based on this last model, fitGam1_SARIMA2, for the next 60 and 90 calendar days and for a long-term period. Note that our model have great predictive performance both for short and long-term time forecasts.
In another front, we compare our results with the one returned by the FB prophet library, as it was trained over the 'rkWh' timeseries alone. We find that this model also provide good results, although shifted towards greater values of forecasted consumed energy than the one had during the same periods in 2014.
'prophet' library over the 'rkWh' timeseries
In [32]:
# R LANG::
df <- data_frame(ds=seq(as.Date('2014-01-01'), as.Date('2014-12-31'), by='d'),
y=as.numeric(melt(rkWh)$value))
m <- prophet(df)
In [33]:
# R LANG::
# Design Matrix
XReg <- fitGam1$linear.predictors
# Energy Consumption Forecasts for the next 60 and 90 calendar days
grid.newpage()
pushViewport(viewport(layout = grid.layout(2,1, widths=unit(1,'npc'), heights=unit(1/2,'npc'))))
# Next Two Months [JAN-FEB (2015)]
nAhead = 31+28
fore <- predict(fitGam1_SARIMA2, n.ahead = nAhead, newxreg = XReg, prediction.interval = TRUE)
future <- make_future_dataframe(m, periods=nAhead+1)
foreFB <- predict(m, future)
# DataFrame of Actual and Predicted Energy Consumption (kWh)
predDF <- data.frame(cbind(date=c(index(rkWh),index(fore$pred)[1: nAhead]),
rkwh=c(melt(rkWh)$value, rep(NA,nAhead)),
pred=c(rep(NA,length(index(rkWh))), melt(fore$pred)$value[1: nAhead]),
se=c(rep(NA, length(index(rkWh))), melt(fore$se)$value[1: nAhead]),
yhat=foreFB$yhat,
yhat_lower=foreFB$yhat_lower,
yhat_upper=foreFB$yhat_upper))
fig1 <- (ggplot(predDF, aes(date)) +
geom_point(aes(y=rkwh, colour='rkWh'), colour="black") +
geom_line(aes(y=pred, colour='R Forecast')) +
geom_ribbon(aes(ymin = pred-se, ymax = pred+se),
alpha = 0.3,
fill= '#009E73',
na.rm = TRUE) +
geom_line(aes(y=yhat, colour='FB prophet')) +
geom_ribbon(aes(ymin = yhat_lower, ymax = yhat_upper),
alpha = 0.3,
fill = '#0072B2',
na.rm = TRUE) +
scale_color_manual(name='',
values = c('R Forecast'='#009E73','FB prophet'='#0072B2')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 0.5),
legend.background = element_blank()) +
labs(title='Energy Consumption Forecast (kWh)\n[next two months]',
x='Months in 2014 (2015)', y='kWh') +
scale_x_yearmon())
# Next Three Months [JAN-MAR (2015)]
nAhead = 31+28+31
fore <- predict(fitGam1_SARIMA2, n.ahead = nAhead, newxreg = XReg, prediction.interval = TRUE)
future <- make_future_dataframe(m, periods=nAhead+1)
foreFB <- predict(m, future)
# DataFrame of Actual and Predicted Energy Consumption (kWh)
predDF <- data.frame(cbind(date=c(index(rkWh),index(fore$pred)[1: nAhead]),
rkwh=c(melt(rkWh)$value, rep(NA,nAhead)),
pred=c(rep(NA,length(index(rkWh))), melt(fore$pred)$value[1: nAhead]),
se=c(rep(NA, length(index(rkWh))), melt(fore$se)$value[1: nAhead]),
yhat=foreFB$yhat,
yhat_lower=foreFB$yhat_lower,
yhat_upper=foreFB$yhat_upper))
fig2 <- (ggplot(predDF, aes(date)) +
geom_point(aes(y=rkwh, colour='rkWh'), colour="black") +
geom_line(aes(y=pred, colour='R Forecast')) +
geom_ribbon(aes(ymin = pred-se, ymax = pred+se),
alpha = 0.3,
fill= '#009E73',
na.rm = TRUE) +
geom_line(aes(y=yhat, colour='FB prophet')) +
geom_ribbon(aes(ymin = yhat_lower, ymax = yhat_upper),
alpha = 0.3,
fill = '#0072B2',
na.rm = TRUE) +
scale_color_manual(name='',
values = c('R Forecast'='#009E73','FB prophet'='#0072B2')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 0.5),
legend.background = element_blank()) +
labs(title='Energy Consumption Forecast (kWh)\n[next three months]',
x='Months in 2014 (2015)', y='kWh') +
scale_x_yearmon())
# Print the plots in the pre-defined Viewport
print(fig1, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(fig2, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))
dev.off()
In [34]:
# R LANG::
# Design Matrix
XReg <- fitGam1$linear.predictors
# Energy Consumption Forecasts for the next 60 and 90 calendar days
grid.newpage()
pushViewport(viewport(layout = grid.layout(3,1, widths=unit(1,'npc'), heights=unit(1/3,'npc'))))
# Next Semester (2015)
nAhead = 31+28+31+30+31+30
fore <- predict(fitGam1_SARIMA2, n.ahead = nAhead, newxreg = XReg, prediction.interval = TRUE)
future <- make_future_dataframe(m, periods=nAhead+1)
foreFB <- predict(m, future)
# DataFrame of Actual and Predicted Energy Consumption (kWh)
predDF <- data.frame(cbind(date=c(index(rkWh),index(fore$pred)[1: nAhead]),
rkwh=c(melt(rkWh)$value, rep(NA,nAhead)),
pred=c(rep(NA,length(index(rkWh))), melt(fore$pred)$value[1: nAhead]),
se=c(rep(NA, length(index(rkWh))), melt(fore$se)$value[1: nAhead]),
yhat=foreFB$yhat,
yhat_lower=foreFB$yhat_lower,
yhat_upper=foreFB$yhat_upper))
fig1 <- (ggplot(predDF, aes(date)) +
geom_point(aes(y=rkwh, colour='rkWh'), colour="black") +
geom_line(aes(y=pred, colour='R Forecast')) +
geom_ribbon(aes(ymin = pred-se, ymax = pred+se),
alpha = 0.3,
fill= '#009E73',
na.rm = TRUE) +
geom_line(aes(y=yhat, colour='FB prophet')) +
geom_ribbon(aes(ymin = yhat_lower, ymax = yhat_upper),
alpha = 0.3,
fill = '#0072B2',
na.rm = TRUE) +
scale_color_manual(name='',
values = c('R Forecast'='#009E73','FB prophet'='#0072B2')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 0.5),
legend.background = element_blank()) +
labs(title='Energy Consumption Forecast (kWh)\n[Next Semester]',
x='Months in 2014 (2015)', y='kWh') +
scale_x_yearmon())
# Next Year (2015)
nAhead = 365
fore <- predict(fitGam1_SARIMA2, n.ahead = nAhead, newxreg = XReg, prediction.interval = TRUE)
future <- make_future_dataframe(m, periods=nAhead+1)
foreFB <- predict(m, future)
# DataFrame of Actual and Predicted Energy Consumption (kWh)
predDF <- data.frame(cbind(date=c(index(rkWh),index(fore$pred)[1: nAhead]),
rkwh=c(melt(rkWh)$value, rep(NA,nAhead)),
pred=c(rep(NA,length(index(rkWh))), melt(fore$pred)$value[1: nAhead]),
se=c(rep(NA, length(index(rkWh))), melt(fore$se)$value[1: nAhead]),
yhat=foreFB$yhat,
yhat_lower=foreFB$yhat_lower,
yhat_upper=foreFB$yhat_upper))
fig2 <- (ggplot(predDF, aes(date)) +
geom_point(aes(y=rkwh, colour='rkWh'), colour="black") +
geom_line(aes(y=pred, colour='R Forecast')) +
geom_ribbon(aes(ymin = pred-se, ymax = pred+se),
alpha = 0.3,
fill= '#009E73',
na.rm = TRUE) +
geom_line(aes(y=yhat, colour='FB prophet')) +
geom_ribbon(aes(ymin = yhat_lower, ymax = yhat_upper),
alpha = 0.3,
fill = '#0072B2',
na.rm = TRUE) +
scale_color_manual(name='',
values = c('R Forecast'='#009E73','FB prophet'='#0072B2')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 0.5),
legend.background = element_blank()) +
labs(title='Energy Consumption Forecast (kWh)\n[Next Year]',
x='Months in 2014 (2015)', y='kWh') +
scale_x_yearmon())
# Next Year + One Month (2015, 2016)
nAhead = 365 + 31
fore <- predict(fitGam1_SARIMA2, n.ahead = nAhead, newxreg = XReg, prediction.interval = TRUE)
future <- make_future_dataframe(m, periods=nAhead+1)
foreFB <- predict(m, future)
# DataFrame of Actual and Predicted Energy Consumption (kWh)
predDF <- data.frame(cbind(date=c(index(rkWh),index(fore$pred)[1: nAhead]),
rkwh=c(melt(rkWh)$value, rep(NA,nAhead)),
pred=c(rep(NA,length(index(rkWh))), melt(fore$pred)$value[1: nAhead]),
se=c(rep(NA, length(index(rkWh))), melt(fore$se)$value[1: nAhead]),
yhat=foreFB$yhat,
yhat_lower=foreFB$yhat_lower,
yhat_upper=foreFB$yhat_upper))
fig3 <- (ggplot(predDF, aes(date)) +
geom_point(aes(y=rkwh, colour='rkWh'), colour="black") +
geom_line(aes(y=pred, colour='R Forecast')) +
geom_ribbon(aes(ymin = pred-se, ymax = pred+se),
alpha = 0.3,
fill= '#009E73',
na.rm = TRUE) +
geom_line(aes(y=yhat, colour='FB prophet')) +
geom_ribbon(aes(ymin = yhat_lower, ymax = yhat_upper),
alpha = 0.3,
fill = '#0072B2',
na.rm = TRUE) +
scale_color_manual(name='',
values = c('R Forecast'='#009E73','FB prophet'='#0072B2')) +
theme(legend.position = c(0.9,0.9), legend.justification=c(0.5, 0.5),
legend.background = element_blank()) +
labs(title='Energy Consumption Forecast (kWh)\n[Next Year + one Month]',
x='Months in 2014 (2015, 2016)', y='kWh') +
scale_x_yearmon())
# Print the plots in the pre-defined Viewport
print(fig1, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(fig2, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))
print(fig3, vp=viewport(layout.pos.row = 3, layout.pos.col = 1))
dev.off()
The regression model with autocorrelated errors we fit previously, namely fitGam1_SARIMA2, can provide accurate short and long-term forecasts.
More specifically, the simulated time series of energy consumption which is returned by the model is surprisingly similar with the one provided, except of course some peaks during the hot summer days and the New Year's Eve. At these times the forecasted consumption is significantly lower than the one had during the same periods in 2014. The reason for this deficiency is two-fold. First, the 'Temp' timeseries and its contribution in the energy consumption can be better accomodated by a better initial regression model. Secondly, the specific energy consumption peak around New Year's Eve, was treated as an outlier by our model. Having historical data of two and more years would allow us to better incorporate these seasonalities in our model and hopefully provide even better predictions.
As a final remark, I expect that this kind of timeseries ('kWh') and its dependence from quantitative ('Temp') or categorical variables ('Traffic') can be also studied by utilizing other time domain methods or state-space practices. On the other hand, the results returned by the FB prophet library, as it was trained over the 'rkWh' timeseries alone, were also good but shifted towards greater values of forecasted consumed energy than the one had during the same periods in 2014. We believe that having the ability to incorporate the 'kWh' ~ 'Temp' & 'Traffic' dependence in this model, would make it much better.
In [ ]: